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Abstract 

In this paper, we propose a novel unsupervised 
deep learning model, called PCA-based Convolu¬ 
tional Network (PCN). The architecture of PCN is 
composed of several feature extraction stages and a 
nonlinear output stage. Particularly, each feature 
extraction stage includes two layers: a convolu¬ 
tional layer and a feature pooling layer. In the con¬ 
volutional layer, the filter banks are simply learned 
by PCA. In the nonlinear output stage, binary hash¬ 
ing is applied. For the higher convolutional lay¬ 
ers, the filter banks are learned from the feature 
maps that were obtained in the previous stage. To 
test PCN, we conducted extensive experiments on 
some challenging tasks, including handwritten dig¬ 
its recognition, face recognition and texture clas¬ 
sification. The results show that PCN performs 
competitive with or even better than state-of-the- 
art deep learning models. More importantly, since 
there is no back propagation for supervised finetun- 
ing, PCN is much more efficient than existing deep 
networks. 


1 Introduction 

Traditional models for classification tasks are generally com¬ 
posed of hand-crafted feature extraction and a trainable clas¬ 
sifier. The m ost po pular hand-crafted features include Gabor 
features fTao et al . , 20071, locally binary patterns (LBP) |Guo 
and Zhang, 20 10 ) 1, Ho g lOnishi et al., 2008] and SIFT IKe 
and Sukthankar, 20041. They have been successfully applied 
in texture classification, face recognition and object recogni¬ 
tion tasks. However, features extracted by hand-crafted meth¬ 
ods are always low-level and suited to specific data and tasks 
with prior knowledge. 

Recently, deep learning has become a popular way of auto¬ 
matically learning features from data that disentangles the un¬ 
derlying factor of variations. The proposed deep approaches 
always include layerwise stacking of feature extractors. For 
example, deep belief networks are composed of stacking 
pre-trained restricted Boltzmann machines (RBMs) and deep 
auto-encoders are stacked by RBMs or auto-encoders. Deep 
architectures lead to learn more hierarchical and more ab¬ 
stract features at higher layers of representations. 


One of the most powerful deep architectures is a biolog¬ 
ically inspired model - convolutional networks (ConvNets). 
ConvNets are a trainable multi-stage architecture with each 
stage composed of three layers: the filter banks layer, non¬ 
linearity layer and feature pooling layer. Weight sharing in 
the convolution layer and pooling operations are the key of 
the ConvNets which lead to features invariant to small vari¬ 
ations. A deep ConvNets with multistage architecture can 
learn hierarchical features, from local low-level features to 
global high-level ones. However, training such a deep net¬ 
work typically uses gradient descent method in a supervised 
mode, which always need a large scale of labeled samples for 
training. In addition, good results sometimes depend on the 
tricks of the trade for parameter tuning, e.g. using ’’dropout” 
for regulation I Hinton et al., 2012) . 


Recent research has shown that using unsupervised learn¬ 
ing in each stage of ConvNets helped reducing the require¬ 
ment of labeled data significantly. PCANet is such a vari¬ 
ation of deep convolutional networks of which convolution 
filter banks in each st age are simply chosen from PCA fil¬ 
ters |Chan etal., 2014| . Surprisingly, when such simple filters 
are used in a deep network architecture, it has demonstrated 
competitive performance with other deep networks. How¬ 
ever, PCANet dispenses with the pooling layer in the feature 
learning stage, but only uses block-wise histogram together 
with nonlinear operation in the output stage. This results in 
the exponentially growing dimensions and training time with 
increasing number of samples. 


In this paper, we propose a convolutional architecture in 
which the filters are learned from PCA in an unsupervised 
mode. The network is composed of feature extraction stage 
which could be stacked to multiple stages and a nonlinearity 
stage. Feature extraction stage includes a convolution layer 
and a pooling layer and can be easily cascaded to a deep ar¬ 
chitecture. The nonlinearity stage includes binary hashing 
and histogram statistics; the output is then fed into a train- 
able classifier. The filter bank in convolution layer is learned 
by PCA, and the generated feature maps are aggregated by 
pooling layers. This results in multiple sets of feature maps 
corresponding to different filters which probably detect dis¬ 
tinctive features (e.g. detect features at similar orientations) 
of the input. The filter banks in the higher convolution layer 
are computed based on combinations of multiple sets of fea¬ 
ture maps. This is inspired by the intuition that high level 


















features are the combinations and abstract of low level fea¬ 
tures. Multiple feature maps corresponding to an input rep¬ 
resent different features extracted from the same input. Ex¬ 
periments show the comparative performance in classification 
tasks against state-of-the-art approaches. 


2 Related Work 


In the past few years, variations of convolutional network 
have been proposed with respect to the pooling and con¬ 
volutional operation. Recently, unsupervised learning was 
used for pre-training in each stage that would alleviate 
the need of labeled data. When all the stages were pre¬ 
trained, the network was fine-tuned by using stochastic gra¬ 
dient descent method. Many methods were proposed to 
pre-train filter banks of convolution layers in an unsuper¬ 
vised feature lear ning mode. The co n volutional versions 
of sparse RBMs [ Jajrett et al., 2009) |Lee et al., 2009a) 
, sparse coding [Bruna and Mallat ,2013| and predictive 
sparse de c omposition(PSD)~[Jarrett et al., 2009 IHen aff et 
al., 2011 1 |Kavukcuoglu et al., 2009 1 |Kavukcuoglu et al., 


2010 | j were reported and achieved high accuracy on several 


benchmarks. 

Alternatively, some networks similar to ConvNets were 
proposed but used pre-fixed filters in convolution layer and 
yielded goo d performance on sever al benchmarks. In [Serre 
|et al., 2 005) [[Mutch and Lowe, 20061, Gabor filters were used 
in the first convoluti on layer. Meanwhile, wavelet scattering 
networks (ScatNet) [ Bruna and Mallat, 2013) |Sifre and~M al- 
lat, 2013) also used pre-fixed convolutional filters which were 


called scattering operators. By using a similar multiple levels 
of ConvNets, the algorithm had achieved impressive results 
in the applications of handwritten digits and texture recogni¬ 
tion. One more closely related work is called PCANet [Chan 


et al., 20141, which simply use PC A filters in an unsupervised 


learning mode at the convolution layer. Built upon a multiple 
convolution layers, a nonlinear output stage was applied with 
hashing and block-wise histogram. Just a few cascaded con¬ 
volution layers were demonstrated to be able to achieve new 
records in several challenging vision tasks, such as face and 
handwritten recognition, and comparative results on texture 
classification and object recognition. 


Suppose we are given N input images which are denoted as 
{k}iLi ; the size of each input image is m x n. The filter size 
used in each stage is represented as k\ x In the following 
we describe each stage of PCN in detail. 


3.1 The first feature extraction stage in PCN 


Inspired by w eight sha ring of receptive fields in Con¬ 
vNets [ Jarrett et al., 20091, for each input image, we sample a 
number of patches with a size of k\ x at every k pixel loca¬ 
tions, i.e. the sample interval is k pixels. Each patch is vector¬ 
ized to form a column with elements. Then all patches 
sampled from the same input image are put together to form a 
matrix of size (AqA^) x ((|~ m ~ fcl ] +l)([ n ~ fc2 ] + 1 )), denoted 


as X~ = 


Xi t 2) Xz, 3 ? ' ' ’ 'X'i ([- 




-1+1)- 


tf( fc i fc 2 )x((r r J ^l+ 1 )(r—rf+i)) , where Xij represents the 
vector of the jth patch in R. 

In order to introduce competitions between adjacent 
features within a neighbourhood, each column vec¬ 
tor in the matrix Xi subtracts the mean value of 
the corresponding patch to obtain the matrix Xi = 
\xi 7 i ; Xj' 2 i Xj,3, •• •; x^ ( j- m ~ fc i 1 11 )([- Tt ~ fc2 1 11)1 • This opera¬ 
tion is reminiscent to th e local contra st normalization used by 
ImageNet [Krizhev sky et al., 2012) . Once matrices for all 
input images are constructed in the same way, they are as¬ 
sembled to form a large matrix X = [Xi, X 2 , A 3 , • • • , X^\ 

. Subsequently, each row of X subtracts its mean, the re¬ 
sult is also denoted as X. Eigenvalue decomposition is 
then performed on the matrix XX T . The convolutional fil¬ 
ters are selected as the first L\ principle eigenvectors of 
XX T . Thus, the learned filters can be described as Wi = 
mat kl ,k 2 (qi(XX T )) e R klXk2 ,l = where 

matk 1 ,k 2 ( v ) denote the mapping relationship from vector v 
to a matrix W € R klXk2 , qi(XX T ) represent the Ith eigen¬ 
vector of matrix XX T . The eigenvectors are reshaped to the 
size ki xk 2 . In this way, we obtain L\ filters of size k\ x A?2- 
We subsequently convolute the learned filters with the input 
images to generate filter responses at each pixel location; we 
call the filtering results feature maps. 


Ii = U* Wi, i = 1 , 2 ,3, 


N 


( 1 ) 


3 The PCA-Based Convolutional Network 

The PCN is essentially a multi-stage convolutional network 
that can be trained layer-wise in an unsupervised manner. It 
is composed of cascaded feature extraction stages and a non¬ 
linear output stage. Figure |T| illustrates the structure of a typ¬ 
ical PCN with three stages including the output stage. Each 
feature extraction stage consists of a convolutional layer and 
a pooling layer. The inputs are first convoluted with PCA- 
based filters to produce a set of feature maps. The pooling 
layer generally computes the average or maximum value over 
a neighborhood. The purpose of a pooling layer is to build 
robustness to small distortions and reduce the resolution of 
feature maps by a factor p horizontally and q vertically. The 
propagated feature maps through the pooling layer are then 
fed into the next stage as input. The final output stage of PCN 
comprises binary hashing and block-wise histogram statistics. 


where: * is 2D convolution operation; R is padded with zeros 
before convolution. 

The convolution with each input image produces L\ fea¬ 
ture maps. Each feature map represents particular features 
extracted at corresponding location in the image. We divide 
the feature maps (padded with zeros) generated by the con¬ 
volutional layer into several non-overlapping pooling regions 
of size p x q. Then the max pooling or average pooling is 
applied to the pooling regions. The pooling operation results 
in feature maps with reduced resolution, and these pooling 
features are robust to small distortions. We use S\ to rep¬ 
resent the pooling result of the Ith feature map of the ith 
input image. Given a collection {R}fL 1 of N input im¬ 
ages, through the convolution and pooling operation using 
the Ith filter we obtain N feature maps, which are denoted as 
{5 f j}£L 1 , l = 1,2,3, • • •, L\. Since there are L\ filters in the 
first extraction stage, we obtain XL\ feature maps in total. 











































Stage 2 



Figure 1: The detailed block diagram of the proposed (three-stage) PCN. 



Figure 2: Basic structure of the proposed (three-stage) PCN. 

3.2 The second feature extraction stage in PCN 

The pooled feature maps in the first stage are treated as the 
original input to the second stage. These NL\ feature maps 
are divided into L\ subsets. Each subset includes N feature 
maps which are produced by convoluting the input images 
with the same filter in the previous stage, and they are denoted 
as S l = l = 1,2,3, • • •, L\. Feature maps in one 

subset capture certain features of the input images, whereas 
those in different subsets capture different types of features. 
Figure [2] shows the structure of the proposed PCN. 

Since high level feat ures are the co mbinations a nd abs tract 
of low level features I jGutmann and Hyvarinen, 2013) , we 
combine subsets {S 1 }^ according to certain rule to form 
several groups. Table 1 demonstrates one way to combine the 
subsets. In each group(corresponding to a column in the ta¬ 
ble), feature maps(marked with x) corresponding to the same 
input image are added. The combined subsets are then used 
as the actual input to the second feature extraction stage. 

In Table |T| each row represents a subset of feature maps 
obtained from the previous stage, and each column represents 
a group combining these subsets in a particular way. There 
are 5 filters in the first stage which result in 5 subsets. Two of 
the subsets are added and 5 groups are formed. In the table, 
x indicates corresponding subsets are combined to form one 
group. In practice, an indexing matrix is used to define the 


Table 1: An example of combination ways. The first layer 
consist of 5 filters; two adjacent subsets of feature maps are 
combined. _ 
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way of combination. In the indexing matrix, most entries are 
zeros and a few of entries are ones, which indicate the subsets 
belonging to one group. Thus, different indexing matrices can 
be defined. If the indexing matrix is defined as an identity 
matrix, there will be no combination of subsets. 

The combination produces several new subsets and each 
new subset is denoted as {S[ }fLi , which also consists of N 
feature maps. 

By repeating the same procedure as in the first stage, for 
each {Sl }£ 1? we sample patches from each feature map 
in this subset. Then we also subtract patch mean values 
and join all vectors together to form a matrix denoted as 

Tf \yi',i,l > UVd,2 1 yi\i'3, * * * } ] | i)(|- "- fc2 ] | i)1’ 

where yi^ij represents the mean removed vector of the jth 
patch of the ith feature map in the I'th subset. We further 
collect patches from all the feature maps in this subset, 
remove the patch mean, and concatenate the matrixes Y .- z as 
Y l = [Yj ,YJ; , Y3 , • • •, Y&]. Afterwards the row mean is 
removed form Y 1 '. Since there are L\ subsets, we obtain L\ 
such matrixes Y l , l' = 1,2,3, • • ■, L\. 

For each subset {*S'f}£ 1 , we construct filters using the 
following equation separately: 

vf = mat klM { qi {Y v Y l ' T )) e R k ^ k \l = 1,2,3, ■ • •, L 2 

( 2 ) 



















































































































For each subset, we choose the first L 2 principle eigenvectors 
as PCA filters, denoted as {V} } z = r Each input feature map 
in this subset is convoluted with L 2 filters, which resulted in 
L 2 new feature maps. Since there are Li subsets(produced 
by L\ groups), we produce L 1 NL 2 feature maps in total in 
the second stage, and they are the output of the second feature 
extraction stage (C2 in Figure [2}. 

The pooling process in the second stage is the same as in 
the first stage. The output feature maps of C2 are divided into 
several non-overlapping patches with size px q and the max¬ 
imum or average value is calculated over the pooling region. 

If there are more feature extraction stages, the process is 
repeated in the same way as described above. 


3.3 The output stage in PCN 


In the output stage we reconstruct feature maps to form 
final representations of the input image. We use binary 
hashing a nd histo gr am statis tics (called ’’hashingHist”) as in 
PCANet [Chan et al., 2014]. Each input feature map Sf to 
the second stage produces L 2 output maps. We binarize these 


output maps and calculate H (Sf' * Vi), where H(.) is a Heav¬ 
iside step function whose value is one for positive entries and 
zero otherwise. For each pixel location, we treat the vector 
of Z /2 binary bits as a decimal number. This converts the 
1/2 outputs generated in the second stage back into a single 
integer-valued image. 

For each of the Li integer-valued images, we partition it 
into B blocks. We compute the histogram of the decimal val¬ 
ues in each block, and concatenate all the B histograms into 
one vector. After this encoding process, the feature of the in¬ 
put image U becomes the set of block-wise histograms. The 
local blocks can be either overlapping or non-overlapping, de¬ 
pending on applications. 


Table 2: Comparison of digit recognition rates(%) of different 
methods on Basic MNIST._ 


Method 

Accuracy 

PCANet-2 

98.94 

CAE-2 

97.52 

ScatNet-2 

98.73 

PCN-2 

99.20 


Table 3: Comparison of digit recognition rates(%) of different 
methods on standard MNIST. 


Methods 


Accuracy 

HSC (Yu et al., 20111 

K-NN-SCM IBelongie et al ., 20021 

99.23 

99.37 

K-NN-IDM 

Keysers et al., 2007 j 

99.46 

CDBN |Lee et al, 2009b j 

ConvNetHparrett et at ., 20091 

99.18 

99.47 

ScatNet-2 (SVM r bf) IBruna and Mallat, 20131 
PCANet-2 

99.57 

99.34 

PCN-2 


99.41 


Digit recognition on the basic MNIST Dataset 

The basic MNIST dataset is a smaller subset of MNIST. It 
contains 10000 training images, 2000 validation images and 
50000 testing images. We first perform our experiment on the 
basic dataset. The hyper-parameters were selected to maxi¬ 
mize the performance on the validation set. Then, the system 
was trained over the entire training set and validation set. We 
achieve the highest accuracy of 99.20% when the numbers of 
filters in the first stage and second stage are set to 6 and 11 
respectively.This is higher than related methods in literature. 


4 Experiments 

In all experiments, a three-stage (including the final output 
stage) PCN is applied to different data sets for simplicity. The 
final output features of the PCN are sent to a linear SVM for 
classification. All these configurations keep fixed. We com¬ 
pared the efficiency of PCN for different recognition tasks 
using the same desktop PC with an Intel i5-3570 CPU and 
32GB memory. 

4.1 Digit Recognition based on the MNIST 
Datasets 

Because images in the MNIST Datasets are small, we set the 
patching sampling interval as 1, i.e. we sample a patch at 
each pixel location. The patch size is set as 7 x 7. In the 
output stage, we set the block size as 7 x 7, and we set the 
block overlapping ratio as 0.5. The three parameters keep 
unchanged during the experiment. In particular the pooling 
layer is disabled in every feature extraction stage, and it can 
be easily controlled by a parameter in our code. We select an 
identity matrix as the indexing matrix, that is, we make every 
group in the second stage contain only one subset. 


Digit recognition on the standard MNIST Dataset 

The standard MNIST dataset consists of 60000 training im¬ 
ages and 10000 testing images. To adjust hyperparameters, 
a validation set of 5 samples per class was taken out of the 
training sets. The hyper-parameters were selected to maxi¬ 
mize the performance on the validation set. Then, the system 
was trained over the entire training set. We found the best 
configuration when the numbers of filters in the first and sec¬ 
ond stage were set to 8 and 10 respectively, and the accuracy 
reached 99.41%, which outperformed the related works, as 
shown in table [3] Overall, PCN can achieve competitive per¬ 
formance compared to the state-of-the-art, but with much less 
computation due to its simple network structure. 


Table 4: Face recognition rates(%) and time consumption(s) 

on Ext ended Yale B. _ 

Methods PCANet-2 PCN-2 



Training Time 8551.75 2054.42 

Test Time/Sample 1.39 0.27 





































Figure 3: Filters learned on the Extended Yale B dataset, (a) 
11 filters in the first stage;(b) There are 11 groups in the sec¬ 
ond stage, and each group contains 8 filters, shown in a col¬ 
umn. 


4.2 Face Recognition on the Extended Yale B 
Dataset 

The extended Yale B dataset contains 2414 frontal-face im¬ 
ages of 38 individuals. The cropped and normalized 192 x 
168 face images were captured under various lighting condi¬ 
tions. For each subject, we randomly select 5 images as our 
testing images, and the rest for training. A validation set of 
5 images per subject was taken out of the training sets. The 
hyper-parameters were selected to maximize the performance 
on the validation set. Then, the system was trained over the 
entire training set. In the end the patch size was set as 5 x 5, 
and the numbers of filters in the first and second stage were set 
as 11 and 8 respectively. The patch sampling interval was set 
as 1. The max pooling module used a 2 x 2 boxcar filter with 
a 2 x 2 down-sampling step. We used non-overlapping blocks 
in the output stage and the block size was set as 8 x 8. Identity 
matrix was used as the indexing matrix in the second stage. 
We achieve the average accuracy of 99.58% over 10 experi¬ 
ments, as shown in table [4] The training time of our method 
including PCN plus SVM is 2054.42s, and the testing time 
per sample is 0.27. This is much more efficient compared to 
PCANet. The filters in the first stage are shown in figure[3ji; it 
is obvious that each filter in the first stage captures direction- 
related features of an input face image. Each column in Fig¬ 
ure contains filters in one group in the second stage; it can 
be seen that the filter banks in different groups are similar to a 
large extend, but there are still some differences, so we can’t 
use just the same filters in different groups. We found that an 
identity matrix was better than other matrix when used as the 
indexing matrix, this may be caused by the blur effect that all 
subsets in on group connected to the same filter bank, so we 
maybe use different filters in the future. 


Table 5: Comparison of accuracies(%) on CUReT dataset. 


Methods 

Accuracy 

Textons 

98.50 

BIF 

98.60 

Histogram 

99.00 

ScatNet-2(PCA) 

99.80 

PCANet-2 

99.61 

PCN-2 

99.71 


4.3 Texture Classification on CUReT Dataset 


The CUReT texture dataset contains 61 categories of textures. 
Each category contains images of the same material with dif¬ 
ferent pose and illumination conditions. In this experiment, 
following PCANet [Chan et ai, 20141, a subset of the origi¬ 
nal data with azimuthal viewing angles less than 60 degrees 
was selected, thereby yielding 92 images in each class. A 
central 200 x 200 region was cropped from each of the se¬ 
lected images. The dataset was randomly split into a training 
and a testing set, with 46 training images for each class. The 
hyper-parameters were selected according to literature. The 
filter size was set as 5 x 5; the patch sampling interval was 
set as 1. The number of filters in both stage was set as 8, and 
non-overlapping block size was 50 x 50. The pooling layer 
was disabled in each extraction stage. Identity matrix was 
used as indexing matrix in the second stage. The accuracy 
reached 99.71%, which was higher than the result of 99.61% 
achieved by PCANet. 


4.4 Texture Classification on Outex Dataset 

Outex is a framework for empirical evaluation of texture clas¬ 
sification and segmentation algorithms. Problems are encap¬ 
sulated into welldefined test suites having precise specifica¬ 
tions of input and output data. Outex database contains sur¬ 
face textures and natural scenes. The collection of surface 
textures is expanding continuously. At this very moment the 
database contains 320 surface textures, both macrotextures 
and microtextures. Many textures have variations in local 
color content, which results in challenging local gray scale 
variations in intensity images. Some of the source textures 
have a large tactile dimension, which can induce considerable 
local gray scale distortions. Each source texture is imaged 
according to certain procedure. The images used in a texture 
classification suite are extracted from the given set of source 
images (particular texture classes, illuminations, spatial res¬ 
olutions, and rotation angles) by centering the sampling grid 
so that equally many pixels are left over on each side of the 
sampling grid. If the training and testing images of a par¬ 
ticular texture classification problem are extracted from the 
same set of source images, the images are divided randomly 
to two halves of equal size for the purpose of obtaining an 
unbiased performance estimate. The directory images in each 
test suite includes the images needed in the test suite. The di¬ 
rectory indexed by three numbers includes the specified prob¬ 
lem in this test suite. Each one of these directories has three 
files:classes.txt,test.txt and train.txt which define the problem. 
The problem indexed by 000 of the OutexTCoOOOA test suite 
was selected in our experiment. After several validating trails, 


























Figure 4: Example samples of our texture dataset. 


the patch size was set as 5 x 5. The patch sampling interval 
was set as 1, and the numbers of filters in the first and second 
stage were set as 18 and 6 respectively. The pooling layer 
was disabled in each stage. The block size was set as 14 x 14 
and the block overlapping ratio was 0.5. An identity matrix is 
used as the indexing matrix in the second stage. We achieve 
the classification accuracy of 99.91%, and the training time is 
260.84s including PCN and SVM. What’s more the test time 
per sample is 0.15s. 

4.5 Texture Classification on Our Dataset 

Procedural models are widely used in computer graphics for 
generating realistic, natural-looking textures. A number of 
procedural models have been proposed and these models can 
produce various textures. Through render these textures are 
presented as surface images. Given a surface image, it is im¬ 
portant to know which model can produce such kind of tex¬ 
ture. This is a typical texture classification problem. Our pro¬ 
cedural texture dataset contains a number of rendered textures 
generated by 23 procedural texture models and then rendered 
by Luxrender given fixed light conditions. Textures generated 
by one method normally are different from those generated by 
other methods; however, some textures produced by different 
models may be perceived similar. This forms a challenging 
classification task. Figure |4] shows example samples of our 
texture dataset. 

The size of surface images in our dataset is 256*256. In 
this experiment, we use a total of 3600 surface images, which 
will be available together with the source code in the near 
future. 

We randomly choose 25% of the images from each method 
as our testing set and the rest are used for training. A valida¬ 
tion set of 5 samples per method was taken out of the training 
set. The hyper-parameters were selected to maximize the per¬ 
formance on the validation set. We found the best configura¬ 
tion that patch size 7x7, patch sampling interval 3, the num¬ 
bers of filters in both extraction stages L\ = 16, L 2 = 38. A 
2x2 boxcar filter with a 2 x 2 down-sampling step was used 
in the pooling layer. In particular the output non-linear stage 
was removed. All feature maps from the feature extraction 
phase were reshaped and concatenated to form a vector as the 
input to the linear SVM classifier. Then the PCN was trained 


Table 6: Comparisons of different methods on our texture 
dataset_ 



PCN-2 

PCANet-2 

Accuracy(%) 

99.89 

99.62 

Train time(s) 

251.80 

16407.50 

Test time per sample(s) 

0.1136 

3.14 


over the entire training set using the best configuration. The 
accuracy reaches 99.89% which is higher than the result of 
99.62% achieved by PCANet. These results are shown in Ta¬ 
ble [6] More importantly, our algorithm is much more effi¬ 
cient than PCANet in terms of computation cost. The large 
numbers of filters suggest that surface images in our dataset 
contains complex structure. 

Figure [5j a) shows that most filters in the first stage extract 
orientation related features of input images. Due to the com¬ 
plex structure of our texture, some filters look complicated. A 
fact is that some surface images have no obvious edge infor¬ 
mation. In figure [5jb), each row contains filters of one group 
in the second stage. It can be observed that prior filters in 
a group can extract large scale features, and posterior filters 
extract more detailed features. 

As a comparison, we also use a traditional CNN for the 
same classification task with the same computation resources. 
After running 10 hours for 50000 iterations, we only achieve 
an accuracy of 43.2%. The performance becomes worse as 
the number of iterations increases. It is obvious that the CNN 
falls into overfitting because we do not have enough training 
samples. 

5 Conclusion 

We propose a PCA-based Convolutional Network (PCN), 
which essent ially has the advantage of both C NN |Jarrett 
et al, 2009) and PCANet [[Chan et al., 20141 , i.e. it can 
achieve competitive performance compared with state-of-the- 
art methods but is much more efficient in terms of computa¬ 
tion. The PCN used in our experiments simply comprises 
two feature extraction stages and a non-linearity output stage. 
However, instead of training the network by using iteration 
methods, PCN simply uses PCA to learn filters in convolu¬ 
tion layer. The eigenvectors are used as the filters to convo¬ 
lute with the input images. 

Similar to other deep networks, it should be noted that a 
proper configuration of PCN is very important for different 
types of inputs. If training images are relatively simple in 
terms of structure and have a large size, we can use a rel¬ 
atively large interval to sample the patches and enable the 
pooling layer to rapidly reduce the feature dimension of the 
input image. On the other hand, if the input image is small 
enough, we may simply set the patch sampling interval to 
one and disable the pooling layer. In the grouping process, 
all subsets in one group are connected to the same filter bank, 
so we can add up all the subsets to form a new subset. But dif¬ 
ferent filter banks maybe work more effectively. We consider 
to use different filter banks in one group in the future. 
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Figure 5: Filters at the first(a) and second stage(b) on our texture dataset. There are 16 filters in the first stage. There are 16 
groups in the second stage and every group contains 38 filters, which are shown in a row. 
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